Skip to content

Conversation

@novatechflow
Copy link
Member

Summary

  • Add Spark Dataset/DataFrame plumbing: Parquet source/sink flag, channel conversions, optimizer cost hints.
  • Document how to build dataset-backed pipelines (README.md, guides/spark-datasets.md).

Next steps / follow-ups

  • ML4All pipelines still emit/consume raw double[]/Double RDDs. We should extend them to use DatasetChannels once schema handling is in place.
  • Text/Object sources currently produce RDD channels. A Record-backed variant (or a conversion helper) would allow dataset output without extra user code.

juripetersen
juripetersen previously approved these changes Jan 7, 2026
@novatechflow
Copy link
Member Author

Fixed the overload issue. Java does not see the Scale overload. Build run through now.

juripetersen
juripetersen previously approved these changes Jan 7, 2026
@novatechflow
Copy link
Member Author

novatechflow commented Jan 7, 2026

This is a likely Scala version mismatch. scala.annotation.JvmOverloads isn’t available in the Scala version used by CI (likely 2.11), but my local build uses a newer Scala, so it compiled locally with Scala version 2.12.17 - it's a bit surprising since it runs locally.

mvn -q -Dexpression=scala.version help:evaluate -DforceStdout
2.12.17%

mvn -pl wayang-api/wayang-api-scala-java -am -DskipTests compile
[INFO] BUILD SUCCESS

@novatechflow novatechflow merged commit e6ce5a9 into apache:main Jan 7, 2026
4 checks passed
@novatechflow novatechflow deleted the feature/spark-dataframes branch January 7, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants